Exploratory Data Analysis of The Vancouver Street Trees Dataset
Contents
1. Exploratory Data Analysis of The Vancouver Street Trees Dataset¶
This report was prepared by Sarah McDonald on December 12, 2021, as the final project for a Data Visualization class at the University of British Columbia using a subset of the Vancouver Street Trees Data [] provided.
Fig. 1.1 Street trees in Vancouver¶
# Import libraries needed for this analysis
import pandas as pd
import altair as alt
import json
pandas [] is used to handle data, altair [] is a package used for graphing, and json [] is used to create maps.
# Load in the data and view a subset
trees_url = 'https://raw.githubusercontent.com/UBC-MDS/data_viz_wrangled/main/data/Trees_data_sets/small_unique_vancouver.csv'
trees_df = pd.read_csv(trees_url, parse_dates=['date_planted'])
trees_df.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | plant_area | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | 15 | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | 7 | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | 7 | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | 7 | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | N | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 |
5 rows × 21 columns
# get more information about our datasset
trees_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Unnamed: 0 5000 non-null int64
1 std_street 5000 non-null object
2 on_street 5000 non-null object
3 species_name 5000 non-null object
4 neighbourhood_name 5000 non-null object
5 date_planted 2363 non-null datetime64[ns]
6 diameter 5000 non-null float64
7 street_side_name 5000 non-null object
8 genus_name 5000 non-null object
9 assigned 5000 non-null object
10 civic_number 5000 non-null int64
11 plant_area 4950 non-null object
12 curb 5000 non-null object
13 tree_id 5000 non-null int64
14 common_name 5000 non-null object
15 height_range_id 5000 non-null int64
16 on_street_block 5000 non-null int64
17 cultivar_name 2658 non-null object
18 root_barrier 5000 non-null object
19 latitude 5000 non-null float64
20 longitude 5000 non-null float64
dtypes: datetime64[ns](1), float64(3), int64(5), object(12)
memory usage: 820.4+ KB
2. Questions of Interest¶
For this analysis I am interested in how the number and type of trees planted has changed over time. From our initial look at the data, I can see that a lot of values are missing from the ‘date_planted’ column. This could be an error in data recording or it could be that we don’t have records of when older trees were planted. To visualize the gaps in our data, let’s first plot the dates we do have.
# rug plot to visualize date_planted column data
trees_date = alt.Chart(trees_df).mark_tick().encode(
alt.X("date_planted:T", scale=alt.Scale())
)
trees_date
It looks like we have continuous data from 1989-2019. If our theory is correct and data without values in the ‘date_planted’ column is from older trees, we could expect these trees to be larger than trees planted more recently. Let’s see if that holds true for our data.
To make the data easier to filter I will use the pandas [] package to add a new column to our data frame. A simple boolean will let us see if the date planted is availabe for that entry.
{
trees_nan = trees_df.assign(date_record = trees_df.isna().loc[:,'date_planted'])
}
# add a boolean column to our datafrom for data_planted data available
trees_nan = trees_df.assign(date_record = trees_df.isna().loc[:, 'date_planted'])
trees_nan.head()
| Unnamed: 0 | std_street | on_street | species_name | neighbourhood_name | date_planted | diameter | street_side_name | genus_name | assigned | ... | curb | tree_id | common_name | height_range_id | on_street_block | cultivar_name | root_barrier | latitude | longitude | date_record | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 10747 | W 20TH AV | W 20TH AV | PLATANOIDES | Riley Park | 2000-02-23 | 28.5 | EVEN | ACER | N | ... | Y | 21421 | NORWAY MAPLE | 4 | 0 | NaN | N | 49.252711 | -123.106323 | False |
| 1 | 12573 | W 18TH AV | W 18TH AV | CALLERYANA | Arbutus-Ridge | 1992-02-04 | 6.0 | ODD | PYRUS | N | ... | Y | 129645 | CHANTICLEER PEAR | 2 | 2300 | CHANTICLEER | N | 49.256350 | -123.158709 | False |
| 2 | 29676 | ROSS ST | ROSS ST | NIGRA | Sunset | NaT | 12.0 | ODD | PINUS | N | ... | Y | 154675 | AUSTRIAN PINE | 4 | 7800 | NaN | N | 49.213486 | -123.083254 | True |
| 3 | 8856 | DOMAN ST | DOMAN ST | AMERICANA | Killarney | 1999-11-12 | 11.0 | EVEN | FRAXINUS | N | ... | Y | 180803 | AUTUMN APPLAUSE ASH | 4 | 6900 | AUTUMN APPLAUSE | N | 49.220839 | -123.036721 | False |
| 4 | 21098 | EAST BOULEVARD | EAST BOULEVARD | HIPPOCASTANUM | Shaughnessy | NaT | 15.5 | ODD | AESCULUS | Y | ... | Y | 74364 | COMMON HORSECHESTNUT | 4 | 5200 | NaN | N | 49.238514 | -123.154958 | True |
5 rows × 22 columns
To account for differences in species we want to break the records down by species. First let’s see how many species we are working with.
species = trees_nan.groupby("species_name")
species.describe()
| Unnamed: 0 | diameter | ... | latitude | longitude | |||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | std | min | 25% | 50% | 75% | max | count | mean | ... | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
| species_name | |||||||||||||||||||||
| ABIES | 3.0 | 11484.666667 | 7736.631718 | 4347.0 | 7374.00 | 10401.0 | 15053.5 | 19706.0 | 3.0 | 16.000000 | ... | 49.251689 | 49.265250 | 3.0 | -123.139497 | 0.082919 | -123.191800 | -123.187300 | -123.182800 | -123.113346 | -123.043891 |
| ACERIFOLIA X | 60.0 | 14736.833333 | 7736.247569 | 1152.0 | 8729.25 | 12926.0 | 21225.0 | 29978.0 | 60.0 | 22.355000 | ... | 49.263235 | 49.289708 | 60.0 | -123.117517 | 0.047075 | -123.198230 | -123.150238 | -123.122816 | -123.078775 | -123.030066 |
| ACUTISSIMA | 19.0 | 16161.631579 | 8395.660984 | 2483.0 | 11159.00 | 16611.0 | 23396.5 | 28798.0 | 19.0 | 11.355263 | ... | 49.263155 | 49.285991 | 19.0 | -123.087162 | 0.038076 | -123.166011 | -123.113721 | -123.089016 | -123.058271 | -123.028403 |
| ALNIFOLIA | 7.0 | 19888.285714 | 6129.725299 | 11189.0 | 15721.50 | 21692.0 | 24053.0 | 26788.0 | 7.0 | 7.642857 | ... | 49.271948 | 49.290517 | 7.0 | -123.086372 | 0.052193 | -123.157361 | -123.132622 | -123.055624 | -123.044008 | -123.038358 |
| ALPINUM | 1.0 | 7160.000000 | NaN | 7160.0 | 7160.00 | 7160.0 | 7160.0 | 7160.0 | 1.0 | 8.000000 | ... | 49.261980 | 49.261980 | 1.0 | -123.176110 | NaN | -123.176110 | -123.176110 | -123.176110 | -123.176110 | -123.176110 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| WATERERI X | 3.0 | 10674.000000 | 4363.166396 | 7523.0 | 8184.00 | 8845.0 | 12249.5 | 15654.0 | 3.0 | 18.833333 | ... | 49.247132 | 49.258560 | 3.0 | -123.137358 | 0.067524 | -123.209370 | -123.168305 | -123.127239 | -123.101351 | -123.075464 |
| X YEDOENSIS | 90.0 | 16544.900000 | 8492.142408 | 832.0 | 9711.75 | 17409.5 | 22845.5 | 29792.0 | 90.0 | 7.547222 | ... | 49.256834 | 49.289456 | 90.0 | -123.117258 | 0.057906 | -123.220360 | -123.166355 | -123.130054 | -123.058314 | -123.025868 |
| XX | 57.0 | 16790.192982 | 9060.538799 | 397.0 | 8913.00 | 18314.0 | 25920.0 | 29855.0 | 57.0 | 3.504386 | ... | 49.261244 | 49.289050 | 57.0 | -123.097158 | 0.050363 | -123.209720 | -123.137614 | -123.088452 | -123.060023 | -123.023650 |
| YUNNANENSIS | 1.0 | 5188.000000 | NaN | 5188.0 | 5188.00 | 5188.0 | 5188.0 | 5188.0 | 1.0 | 10.000000 | ... | 49.220989 | 49.220989 | 1.0 | -123.100972 | NaN | -123.100972 | -123.100972 | -123.100972 | -123.100972 | -123.100972 |
| ZUMI | 65.0 | 13045.923077 | 8931.664130 | 33.0 | 5494.00 | 9988.0 | 21052.0 | 29456.0 | 65.0 | 5.203846 | ... | 49.264133 | 49.285638 | 65.0 | -123.101302 | 0.058736 | -123.214080 | -123.157532 | -123.082970 | -123.055951 | -123.026981 |
171 rows × 64 columns
3. Top 10¶
171 is a lot of species to visualize all at once. Let’s find our top 10 using the pandas package [] to group entries by their common name, count those entries, and sort from most to least common. Then we can filter so we see only the first 10 entries, the 10 most common trees planted!
{
trees_common = (trees_nan.groupby("common_name").count().sort_values(by='tree_id', ascending=False
).reset_index().loc[0:9])
}
#find the 10 most common trees in our dataset
trees_common = (trees_nan.groupby("common_name").count().sort_values(by='tree_id', ascending=False
).reset_index().loc[0:9])
trees_common = trees_common["common_name"].tolist()
trees_common
['KWANZAN FLOWERING CHERRY',
'PISSARD PLUM',
'NORWAY MAPLE',
'CRIMEAN LINDEN',
'PYRAMIDAL EUROPEAN HORNBEAM',
'NIGHT PURPLE LEAF PLUM',
'KOBUS MAGNOLIA',
'AKEBONO FLOWERING CHERRY',
'RED MAPLE',
'KATSURA TREE']
# filter trees_nan to include only the most common trees
common_records = trees_nan.common_name.isin(trees_common)
trees_nan_small = trees_nan[common_records]
# chart average tree diameter per species (most common)
tree_diam = alt.Chart(trees_nan_small).mark_boxplot().encode(
alt.X('diameter:Q'),
alt.Y('common_name:N'),
).properties(width=300).facet('date_record')
tree_diam
As we can see from the chart above, trees without a date record do have a higher median diameter than trees with a date record. Trees increase in circumference as they age, a general formula for estimating the age of a trees is the diameter of the tree multiplied by a growth factor specific to the species.[]
Our theory that trees without date records are older seems be correct, we will exclude these values from future plots regarding date. To make analysis easier, I will add a column with just the year planted.
# remove entries with no date_planted
trees_small = trees_df.dropna(subset=['date_planted'])
# create a new column with just year
trees_small = trees_small.assign(year_planted = trees_small['date_planted'].dt.year)
# number of trees planted over time
trees_time = alt.Chart(trees_small).mark_bar().encode(
alt.X('year_planted:O'),
alt.Y('count()'))
trees_time
4. Click to filter¶
Let’s make this chart clickable so we can filter our top 10 tree species by year.
click_year = alt.selection_multi(encodings=['x'], on='click')
click_trees_year = (trees_time.encode(
opacity=alt.condition(click_year, alt.value(1), alt.value(0.5)))
.properties(height=100, width=500)
.add_selection(click_year))
# select 10 most common trees based on year
species_select = (alt.Chart(trees_small).transform_filter(click_year).mark_bar().encode(
alt.Y('species_name:N', sort='x'),
alt.X('species_count:Q'),
).transform_aggregate(
species_count="count()",
groupby=["species_name"]
).transform_window(
rank='rank(species_count)',
sort=[alt.SortField("species_count", order="descending")]
).transform_filter((alt.datum.rank <= 10)).add_selection(click_year))
species_select & click_trees_year
Interesting, there is less overlap in the top 10 species per year than I thought there would be. Now, I would like to look more at the size of trees. I wonder how the method of planting affects a tree’s size. To visualize I will use our top 10 data subset.
# Tree diameter vs height colored by species
tree_height = alt.Chart(trees_nan_small).mark_circle().encode(
alt.X('diameter:Q'),
alt.Y('height_range_id:Q'),
color='species_name:N'
)
tree_height
# facet our size chart by root barrier
tree_height.facet('root_barrier:N')
# facet tree size by side of street
tree_side = tree_height.properties(width=200).facet('street_side_name')
tree_side
5. Barriers to barriers¶
It looks like the side of the street trees are planted on makes no difference to size however, trees planted with a root barrier do seem to be smaller. Let’s see if the trees with root barriers are younger than those without using our full dataset.
Fig. 5.1 An example of a root barrier. These are used to prevent tree roots from damaging the sidewalk.¶
root_barrier = trees_time.encode(color="root_barrier:N")
root_barrier
It looks like most of the trees with root barriers were planted between 2004 and 2009. Let’s filter our data to include just those years and see if the pattern still holds.
tree_height_filter = alt.Chart(trees_small).transform_filter(
alt.FieldRangePredicate(field='year_planted', range=[2004, 2009])
).mark_circle().encode(
alt.X('diameter:Q'),
alt.Y('height_range_id:Q')
).properties(width=300).facet('root_barrier:N')
tree_height_filter
When we filter just for years that used root barriers the size difference is much less pronounced. Our initial observations about root barriers could have been because a smaller percentage of the data used root barriers.
Now, lets see how the trees are distributed over Vancouver. As part of this course code was provided to create a base map of Vancouver.
# load data to make a map of vancouver (code provided)
url_geojson = 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
data_geojson_remote = alt.Data(url=url_geojson, format=alt.DataFormat(property='features',type='json'))
data_geojson_remote
Data({
format: DataFormat({
property: 'features',
type: 'json'
}),
url: 'https://raw.githubusercontent.com/UBC-MDS/exploratory-data-viz/main/data/local-area-boundary.geojson'
})
# base map of Vancouver (code provided)
vancouver_map = alt.Chart(data_geojson_remote).mark_geoshape(
color = 'white', opacity= 0.5, stroke='black').encode(
).project(type='identity', reflectY=True)
vancouver_map
#Map location of all trees in Vancouver
points = alt.Chart(trees_small).mark_circle(size=20).encode(
longitude='longitude',
latitude='latitude',
).project(type= 'identity', reflectY=True)
point_map = (vancouver_map + points)
point_map
To see how the distribution changes over time I am going to use the clickable year chart we made earlier.
point_map = point_map.encode(
opacity=alt.condition(click_year, alt.value(1), alt.value(0.1)),
color="species_name:N"
).add_selection(click_year)
point_map & click_trees_year
6. Conclusion¶
Interesting, over the years the distribution seems to be spread out evenly. I would have guessed that the street tree program would have started in a few neighbourhoods and branched out from there. There also doesn’t seem to be any clusters of particular species in neighbourhoods but it is hard to tell with so many species to consider. For the analysis report I think it will be interesting to explore the distribution of species planted over time and space using both time charts and a map. Linking our top 10 species per year chart will make the species distribution much easier to visualize. I am also very interested in our findings about the size of trees and root barriers so I will include those in our report as well.